Introduction to PCA

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset. In simpler terms, it helps us summarize the information contained in many variables into a smaller number of “principal components” while retaining as much of the original variation as possible.

Think of PCA as a way to find the most important patterns in your data. It’s like identifying the main ingredients that contribute to the flavor of a complex dish.

When to use PCA:

  • When you have many variables and want to simplify your dataset
  • To visualize high-dimensional data in 2D or 3D
  • To remove redundancy and correlation between variables
  • As a preprocessing step before applying other algorithms

Example 1: PCA on the Iris Dataset

The iris dataset is a classic dataset in statistics and machine learning. It contains measurements of 150 iris flowers from three different species (setosa, versicolor, and virginica). For each flower, four features were measured: 1. Sepal length 2. Sepal width 3. Petal length 4. Petal width

Let’s analyze this dataset using PCA to see if we can reduce these four dimensions while still capturing the important patterns.

Step 1: Load and examine the data

# The iris dataset is built into R, so we don't need to load it
head(iris)  # Look at the first few rows
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# Check the structure of the dataset
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# Summary statistics
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
# Create a pairs plot to see relationships between variables
pairs(iris[1:4], col = iris$Species, 
      main = "Iris Data by Species", 
      pch = 21, bg = c("red", "green", "blue")[unclass(iris$Species)])

Step 2: Perform PCA

# Perform PCA on numeric variables (columns 1-4)
# We scale the data (standardize) because the measurements are in different units
iris_pca <- prcomp(iris[, 1:4], scale = TRUE)

# Examine the PCA results
summary(iris_pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion  0.7296 0.9581 0.99482 1.00000
# Look at the loadings (how much each original variable contributes to each PC)
iris_pca$rotation
##                     PC1         PC2        PC3        PC4
## Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
## Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971

Step 3: Understand the PCA results

The summary(iris_pca) gives us: - Standard deviation of each principal component - Proportion of variance explained by each component - Cumulative proportion of variance explained

The iris_pca$rotation shows us how much each original variable contributes to each principal component. These values are called “loadings.”

Step 4: Visualize the results

# Create a dataframe with the principal components
iris_pca_data <- data.frame(
  PC1 = iris_pca$x[, 1],
  PC2 = iris_pca$x[, 2],
  Species = iris$Species
)

# Plot the first two principal components with points colored by species
ggplot(iris_pca_data, aes(x = PC1, y = PC2, color = Species)) +
  geom_point(size = 3, alpha = 0.7) +
  theme_minimal() +
  labs(title = "PCA of Iris Dataset",
       x = "Principal Component 1",
       y = "Principal Component 2")

# Create a scree plot (variance explained by each PC)
var_explained <- iris_pca$sdev^2 / sum(iris_pca$sdev^2)
var_df <- data.frame(
  PC = factor(1:4, levels = 1:4),
  Variance = var_explained
)

ggplot(var_df, aes(x = PC, y = Variance)) +
  geom_col(fill = "steelblue") +
  geom_text(aes(label = round(Variance, 2)), vjust = -0.5) +
  theme_minimal() +
  labs(title = "Variance Explained by Principal Components",
       x = "Principal Component",
       y = "Proportion of Variance Explained")

# Visualize the loadings (contributions of original variables)
loadings_df <- data.frame(
  Variable = rownames(iris_pca$rotation),
  PC1 = iris_pca$rotation[, 1],
  PC2 = iris_pca$rotation[, 2]
)

ggplot(loadings_df, aes(x = PC1, y = PC2)) +
  geom_segment(aes(x = 0, y = 0, xend = PC1, yend = PC2), 
               arrow = arrow(length = unit(0.2, "cm")), color = "red") +
  geom_text(aes(label = Variable), vjust = -0.5) +
  theme_minimal() +
  xlim(-1, 1) + ylim(-1, 1) +
  labs(title = "Loadings Plot: Contribution of Original Variables",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  geom_hline(yintercept = 0, linetype = "dashed")

Step 5: Interpret the results

From the PCA of the iris dataset, we typically find:

  1. First principal component (PC1) often explains about 70-80% of the variance and is strongly influenced by petal length, petal width, and sepal length. This suggests these three measurements tend to vary together.

  2. Second principal component (PC2) often explains about 15-20% of the variance and is primarily influenced by sepal width, which varies somewhat independently from the other measurements.

  3. The scatter plot of PC1 vs. PC2 shows clear separation between the three species, with Setosa forming a distinct cluster and some overlap between Versicolor and Virginica.

  4. With just two principal components, we can capture about 95% of the variance in the original four-dimensional dataset.

Example 2: PCA for Remote Sensing Data

Remote sensing data typically contains many bands (wavelengths) for each pixel, making it high-dimensional. PCA is commonly used to reduce this dimensionality, enhance features, and compress the data while preserving important information.

For this example, we’ll use real-world Landsat 5 TM (Thematic Mapper) remote sensing data that’s available in R packages. Landsat 5 TM is a multispectral scanner with 7 bands covering visible, near-infrared, and shortwave infrared portions of the electromagnetic spectrum, making it ideal for PCA analysis.

Step 1: Load a real-world remote sensing dataset

# Load the sample Landsat dataset - a subset of a Landsat 5 TM scene of Austinburg, OH
data(list = "landsat")  # This loads the 'lsat' RasterBrick object

# Examine the dataset
lsat
## class       : SpatRaster 
## dimensions  : 310, 287, 7  (nrow, ncol, nlyr)
## resolution  : 30, 30  (x, y)
## extent      : 619395, 628005, -419505, -410205  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=22 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
## source(s)   : memory
## names       : B1_dn, B2_dn, B3_dn, B4_dn, B5_dn, B6_dn, ... 
## min values  :    54,    18,    11,     4,     2,   131, ... 
## max values  :   185,    87,    92,   127,   148,   146, ...
# This should show 6 bands from Landsat 5 TM: 
# 1=blue (band 1, 0.45-0.52 μm), 2=green (band 2, 0.52-0.60 μm), 3=red (band 3, 0.63-0.69 μm), 
# 4=near infrared (band 4, 0.76-0.90 μm), 5=shortwave infrared 1 (band 5, 1.55-1.75 μm), 
# 7=shortwave infrared 2 (band 7, 2.08-2.35 μm)
# Note: Landsat 5 TM band 6 (thermal infrared, 10.40-12.50 μm) is often omitted in standard analysis
# because it measures emitted energy rather than reflected energy

# Display band information
print(paste("Number of bands:", lsat@ptr$nlyr()))
## [1] "Number of bands: 7"
print(paste("Spatial resolution:", res(lsat), "meters"))
## [1] "Spatial resolution: 30 meters" "Spatial resolution: 30 meters"
print(paste("Dimensions (rows, cols):", nrow(lsat), "x", ncol(lsat)))
## [1] "Dimensions (rows, cols): 310 x 287"

Understanding Landsat 5 TM Data

Landsat 5 was operational from 1984 to 2013 (one of the longest-running Earth observation satellites), and its Thematic Mapper (TM) sensor provides excellent data for PCA analysis because:

  1. It has multiple bands that capture different parts of the electromagnetic spectrum
  2. Bands often have high correlation (especially the visible bands 1-3)
  3. The 30-meter resolution is suitable for landscape-level analysis
  4. The multispectral nature allows for detection of features not visible to the human eye

Each band is sensitive to different Earth surface features: - Bands 1-3 (visible light): Good for cultural features, vegetation types, and water clarity - Band 4 (NIR): Excellent for vegetation analysis and biomass content - Bands 5 & 7 (SWIR): Valuable for geology, soil moisture, and vegetation moisture content

# Create an RGB true color composite (using Red, Green, Blue bands)
plotRGB(lsat, r = 3, g = 2, b = 1, stretch = "lin", main = "True Color Composite")

# Create a false color composite (using NIR, Red, Green bands)
# This highlights vegetation in red
plotRGB(lsat, r = 4, g = 3, b = 2, stretch = "lin", main = "False Color Composite (NIR)")

# Create another false color composite using SWIR, NIR, and Red
# This can highlight moisture content and geology
plotRGB(lsat, r = 5, g = 4, b = 3, stretch = "lin", main = "False Color Composite (SWIR)")

Step 2: Perform PCA on the remote sensing data

# Perform PCA on the raster stack
landsat_pca <- rasterPCA(lsat)

# Examine the summary of the PCA results
summary(landsat_pca$model)
## Importance of components:
##                            Comp.1     Comp.2      Comp.3      Comp.4
## Standard deviation     34.5862074 12.0022196 2.981810357 1.292922722
## Proportion of Variance  0.8835812  0.1064054 0.006567508 0.001234769
## Cumulative Proportion   0.8835812  0.9899866 0.996554106 0.997788875
##                              Comp.5       Comp.6       Comp.7
## Standard deviation     1.0982925563 1.0307492287 0.8513311231
## Proportion of Variance 0.0008909979 0.0007847776 0.0005353497
## Cumulative Proportion  0.9986798726 0.9994646503 1.0000000000
# Look at the loadings (how much each original band contributes to each PC)
print(landsat_pca$model$loadings)
## 
## Loadings:
##       Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## B1_dn         0.221  0.707  0.334  0.387  0.348  0.258
## B2_dn         0.155  0.407 -0.197  0.102 -0.235 -0.838
## B3_dn         0.273  0.401 -0.324 -0.405 -0.554  0.431
## B4_dn  0.755 -0.613  0.195                            
## B5_dn  0.624  0.589 -0.368         0.323 -0.144       
## B6_dn         0.108        -0.840  0.157  0.500       
## B7_dn  0.178  0.345         0.180 -0.734  0.494 -0.184
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.143  0.143  0.143  0.143  0.143  0.143  0.143
## Cumulative Var  0.143  0.286  0.429  0.571  0.714  0.857  1.000
# Find out how much variance is explained by each component
explained_var <- landsat_pca$model$sdev^2 / sum(landsat_pca$model$sdev^2)
print(paste("Variance explained by each PC:", paste(round(explained_var * 100, 2), "%", collapse = ", ")))
## [1] "Variance explained by each PC: 88.36 %, 10.64 %, 0.66 %, 0.12 %, 0.09 %, 0.08 %, 0.05 %"

Step 3: Visualize the PCA results

# Plot the first four principal components
par(mfrow = c(2, 2))
plot(landsat_pca$map$PC1, main = "Principal Component 1", col = gray.colors(100))
plot(landsat_pca$map$PC2, main = "Principal Component 2", col = gray.colors(100))
plot(landsat_pca$map$PC3, main = "Principal Component 3", col = gray.colors(100))
plot(landsat_pca$map$PC4, main = "Principal Component 4", col = gray.colors(100))

par(mfrow = c(1, 1))

# Create an RGB composite using the first three PCs
plotRGB(landsat_pca$map, r = 1, g = 2, b = 3, stretch = "lin", 
        main = "RGB Composite of First Three Principal Components")

# Plot the variance explained by each PC (scree plot)
barplot(explained_var, 
        names.arg = paste0("PC", 1:length(explained_var)),
        main = "Variance Explained by Principal Components",
        xlab = "Principal Component",
        ylab = "Proportion of Variance Explained",
        col = "steelblue")
text(x = 1:length(explained_var), 
     y = explained_var + 0.02, 
     labels = paste0(round(explained_var * 100, 1), "%"), 
     cex = 0.8)

# Add the cumulative variance
cumulative_var <- cumsum(explained_var)
plot(cumulative_var, type = "b", xlab = "Number of Principal Components",
     ylab = "Cumulative Proportion of Variance Explained",
     main = "Cumulative Variance Explained", 
     ylim = c(0, 1))
abline(h = 0.95, col = "red", lty = 2)  # Line at 95% variance explained
text(x = which(cumulative_var >= 0.95)[1], 
     y = 0.95 + 0.03, 
     labels = paste0("95% variance with ", which(cumulative_var >= 0.95)[1], " PCs"), 
     cex = 0.8, col = "red")

Step 4: Compare original bands with PCA results

# Create a function to enhance the visual appearance of an image
# This version works with both older RasterLayer objects and newer SpatRaster objects
enhance_image <- function(img) {
  # Get min and max values in a way that works with different raster types
  if(inherits(img, "SpatRaster")) {
    # For terra package SpatRaster objects
    min_val <- terra::minmax(img)[1]
    max_val <- terra::minmax(img)[2]
  } else {
    # For raster package RasterLayer objects
    min_val <- raster::minValue(img)
    max_val <- raster::maxValue(img)
  }
  
  # Apply normalization for better contrast
  return((img - min_val) / (max_val - min_val))
}

# Alternative enhance_image function if the above doesn't work
# This uses base R operations that should work on most raster objects
enhance_image_alt <- function(img) {
  # Convert to matrix first
  if(inherits(img, "SpatRaster") || inherits(img, "RasterLayer")) {
    img_mat <- as.matrix(img)
  } else {
    img_mat <- img
  }
  
  # Get min and max values
  min_val <- min(img_mat, na.rm = TRUE)
  max_val <- max(img_mat, na.rm = TRUE)
  
  # Normalize
  img_norm <- (img_mat - min_val) / (max_val - min_val)
  
  # Convert back to raster if needed
  if(inherits(img, "SpatRaster")) {
    return(terra::rast(img_norm, extent=terra::ext(img), crs=terra::crs(img)))
  } else if(inherits(img, "RasterLayer")) {
    return(raster::raster(img_norm, template=img))
  } else {
    return(img_norm)
  }
}

# Try to create enhanced versions of some original bands for comparison
# Try the first method and if it fails, use the alternative
tryCatch({
  enhanced_nir <- enhance_image(lsat[[4]])  # NIR band
  enhanced_swir <- enhance_image(lsat[[5]])  # SWIR band
  enhanced_pc1 <- enhance_image(landsat_pca$map$PC1)
  enhanced_pc2 <- enhance_image(landsat_pca$map$PC2)
}, error = function(e) {
  message("Using alternative enhancement method due to error: ", e$message)
  enhanced_nir <- enhance_image_alt(lsat[[4]])  # NIR band
  enhanced_swir <- enhance_image_alt(lsat[[5]])  # SWIR band
  enhanced_pc1 <- enhance_image_alt(landsat_pca$map$PC1)
  enhanced_pc2 <- enhance_image_alt(landsat_pca$map$PC2)
})

# Plot the comparison
par(mfrow = c(2, 2))
plot(enhanced_nir, main = "Original NIR Band", col = gray.colors(100))
plot(enhanced_swir, main = "Original SWIR Band", col = gray.colors(100))
plot(enhanced_pc1, main = "Principal Component 1", col = gray.colors(100))
plot(enhanced_pc2, main = "Principal Component 2", col = gray.colors(100))

# Plot the comparison
par(mfrow = c(2, 2))
plot(enhanced_nir, main = "Original NIR Band", col = gray.colors(100))
plot(enhanced_swir, main = "Original SWIR Band", col = gray.colors(100))
plot(enhanced_pc1, main = "Principal Component 1", col = gray.colors(100))
plot(enhanced_pc2, main = "Principal Component 2", col = gray.colors(100))

par(mfrow = c(1, 1))

# Create a composite that would highlight specific features
# For example, PC1 often represents overall brightness/albedo
# PC2 might represent vegetation vs. non-vegetation contrast
# PC3 might highlight urban areas or water features
plotRGB(landsat_pca$map, r = 3, g = 2, b = 1, stretch = "lin", 
        main = "Alternative PC Composite (PC3, PC2, PC1)")

Step 5: Interpret the Landsat PCA results

In this real-world Landsat 5 TM data, the PCA results can be interpreted as:

  1. First principal component (PC1) typically captures the overall brightness or albedo of the scene (around 70-80% of the variance). It shows the major landscape features and topography. In Landsat imagery, this often correlates with the general landscape structure.

  2. Second principal component (PC2) often highlights the contrast between vegetation (high NIR reflectance) and non-vegetation areas (around 10-15% of the variance). It may emphasize agricultural fields, forests, and urban areas differently.

  3. Third and fourth components (PC3, PC4) reveal more subtle features that aren’t obvious in the original bands, such as different vegetation types, soil moisture variations, or specific land cover types. They might highlight water bodies, certain crop types, or geological features.

  4. The later components (PC5, PC6, PC7) typically contain very little of the total variance (often <1-2%) and mostly show noise, although they occasionally reveal subtle features of interest or specific anomalies in the landscape.

Step 6: Applications of PCA in Landsat and Other Remote Sensing Data

  1. Data compression: Instead of working with all 6 Landsat 5 TM bands, you can often work with just 2-3 PCs that explain 95% or more of the variance. This significantly reduces storage requirements and processing time.

  2. Feature enhancement: The PCA transformation often reveals features that aren’t clearly visible in the original bands. For example, PC2 or PC3 might highlight subtle geological features, urban structures, or different crop types that blend together in standard composites.

  3. Change detection: By performing PCA on multi-temporal Landsat images (e.g., before and after a forest fire or urban development), changes over time can be highlighted more effectively than with original bands.

  4. Noise reduction: Since later PCs (PC5, PC6, PC7) often contain mostly noise in Landsat data, you can reconstruct cleaner imagery by using only the first few PCs.

  5. Land cover classification: PCA can improve classification accuracy by reducing band correlation and highlighting the most important information, making it easier to distinguish between similar land cover types.

Step 7: Additional Remote Sensing PCA Examples

To further explore PCA with Landsat 5 TM data, you could:

  1. Compare seasonal changes: Apply PCA to Landsat images from different seasons to see how the landscape components change.

  2. Analyze specific land cover types: Extract subsets of the image (e.g., just forest areas or agricultural fields) and perform PCA on those to detect subtle variations within a single land cover type.

  3. Create specialized indices: Combine PCs to create specialized indices for detecting specific features of interest, similar to how normalized difference vegetation index (NDVI) works with original bands.

# Example of creating an index from PCs
# This is hypothetical - the actual usefulness would depend on your specific data
pc_index <- (landsat_pca$map$PC2 - landsat_pca$map$PC3) / (landsat_pca$map$PC2 + landsat_pca$map$PC3)
plot(pc_index, main = "PC-based Index", col = terrain.colors(100))

As shown in the document example, you can create various interesting visualizations by combining different principal components in RGB composites. Since PC1 often represents overall brightness/albedo, PC2 typically highlights vegetation vs. non-vegetation contrast, and PC3 might highlight urban areas or water features, different combinations can reveal different aspects of the landscape.

Conclusion

PCA is a powerful technique for dimensional reduction and feature extraction in various fields, from simple datasets like Iris to complex remote sensing imagery like Landsat 5 TM data. By reducing the dimensionality of your data while preserving most of the important information, PCA can help you:

  1. Visualize high-dimensional data
  2. Remove redundancy and correlation between variables
  3. Extract the most important patterns in your data
  4. Reduce computational complexity for subsequent analyses

In the Landsat 5 TM example, we saw how PCA can transform 6 spectral bands into a smaller set of uncorrelated components that still retain most of the information. The first few principal components typically capture the major landscape features, vegetation patterns, and geological structures, while later components may reveal subtle features or isolate noise.

Remember that while PCA is powerful, it has limitations: - It assumes linear relationships between variables - It’s sensitive to outliers - The principal components may not always be easily interpretable

For beginners working with remote sensing data, it’s especially important to compare the PCA results with the original bands to build an understanding of what information each principal component represents in the physical landscape.

PCA is not just a mathematical technique but a practical tool that can enhance your ability to extract meaningful information from complex multispectral data like Landsat imagery.

References

  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. Springer.
  • Jolliffe, I. T. (2002). Principal component analysis. Springer.
  • Jensen, J. R. (2015). Introductory digital image processing: A remote sensing perspective. Pearson.
  • Schowengerdt, R. A. (2006). Remote sensing: Models and methods for image processing. Academic Press.
  • Richards, J. A. (2013). Remote sensing digital image analysis. Springer.